LLM 25일 코스 - Day 12: Hugging Face 소개와 생태계

Day 12: Hugging Face 소개와 생태계

Hugging Face는 AI/ML 커뮤니티의 GitHub이라 불립니다. 모델, 데이터셋, 데모를 공유하고, Transformers 라이브러리로 어떤 모델이든 몇 줄의 코드로 사용할 수 있습니다.

Hugging Face 생태계 구성

서비스	역할	URL
Hub (Models)	50만 개 이상의 모델 저장소	huggingface.co/models
Hub (Datasets)	10만 개 이상의 데이터셋	huggingface.co/datasets
Spaces	데모 앱 호스팅 (Gradio, Streamlit)	huggingface.co/spaces
Transformers	모델 로드/추론 라이브러리	pip install transformers
Datasets	데이터셋 로드/처리 라이브러리	pip install datasets
PEFT	효율적 파인튜닝 (LoRA 등)	pip install peft
TRL	RLHF/DPO 학습 라이브러리	pip install trl
Accelerate	멀티 GPU/TPU 학습	pip install accelerate

계정 생성과 토큰 설정

# 1단계: https://huggingface.co 에서 계정 생성
# 2단계: https://huggingface.co/settings/tokens 에서 토큰 발급
# 3단계: 터미널에서 로그인

# 방법 1: CLI 로그인
# pip install huggingface_hub
# huggingface-cli login

# 방법 2: Python에서 로그인
from huggingface_hub import login

login(token="hf_YOUR_TOKEN_HERE")  # 환경변수 HF_TOKEN 권장

# 방법 3: 환경변수 설정 (.env 파일)
# HF_TOKEN=hf_YOUR_TOKEN_HERE

Transformers로 모델 사용하기

# pip install transformers torch
from transformers import pipeline

# 감성 분석 (한 줄로 끝!)
classifier = pipeline("sentiment-analysis")
result = classifier("I love learning about LLMs!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# 텍스트 생성
generator = pipeline("text-generation", model="gpt2")
output = generator("The future of AI is", max_new_tokens=30)
print(output[0]["generated_text"])

# 번역
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("How are you today?")
print(result)  # [{'translation_text': 'Comment allez-vous aujourd'hui ?'}]

# 질의응답
qa = pipeline("question-answering")
result = qa(
    question="What is Hugging Face?",
    context="Hugging Face is a platform for sharing AI models and datasets.",
)
print(f"답: {result['answer']} (확신도: {result['score']:.2%})")

Datasets 라이브러리

# pip install datasets
from datasets import load_dataset

# 유명 데이터셋 로드 (자동 다운로드 + 캐싱)
dataset = load_dataset("squad", split="train[:100]")
print(f"데이터 수: {len(dataset)}")
print(f"컬럼: {dataset.column_names}")
print(f"첫 번째 예시: {dataset[0]['question']}")

# 한국어 데이터셋
ko_dataset = load_dataset("kor_nlu", "sts", split="train[:50]")
print(f"\n한국어 NLU 데이터: {len(ko_dataset)}개")

# 데이터셋 전처리
def preprocess(example):
    example["question_length"] = len(example["question"])
    return example

processed = dataset.map(preprocess)
print(f"질문 평균 길이: {sum(processed['question_length']) / len(processed):.0f}자")

Hub에서 모델 검색과 다운로드

from huggingface_hub import HfApi, list_models

api = HfApi()

# 한국어 모델 검색
models = list(api.list_models(
    search="korean",
    sort="downloads",
    direction=-1,
    limit=5,
))

print("한국어 관련 인기 모델:")
for model in models:
    print(f"  {model.id} (다운로드: {model.downloads:,})")

# 특정 모델 정보 확인
model_info = api.model_info("meta-llama/Meta-Llama-3.1-8B-Instruct")
print(f"\n모델: {model_info.id}")
print(f"다운로드: {model_info.downloads:,}")
print(f"좋아요: {model_info.likes:,}")
print(f"태그: {model_info.tags[:5]}")

Hugging Face Spaces로 데모 만들기

Spaces는 Gradio나 Streamlit 앱을 무료로 호스팅하는 서비스입니다.

# pip install gradio
import gradio as gr
from transformers import pipeline

# 간단한 감성 분석 데모
classifier = pipeline("sentiment-analysis")

def analyze_sentiment(text):
    result = classifier(text)[0]
    return f"{result['label']} (확신도: {result['score']:.2%})"

demo = gr.Interface(
    fn=analyze_sentiment,
    inputs=gr.Textbox(label="텍스트 입력", placeholder="분석할 문장을 입력하세요"),
    outputs=gr.Textbox(label="감성 분석 결과"),
    title="감성 분석 데모",
    description="텍스트의 감성(긍정/부정)을 분석합니다.",
)

demo.launch()
# Spaces에 배포: huggingface-cli repo create 후 git push

Hugging Face는 LLM 개발의 필수 인프라입니다. 모델 다운로드, 파인튜닝, 배포까지 이 생태계 안에서 모두 해결할 수 있습니다. 다음 주부터는 이 도구들을 활용하여 실전 프로젝트를 시작합니다.

오늘의 연습문제

Hugging Face 계정을 만들고 토큰을 발급받으세요. pipeline("text-generation")으로 GPT-2를 실행하여 한국어와 영어 생성 결과를 비교해보세요.
datasets 라이브러리로 한국어 데이터셋을 하나 찾아 로드하고, 데이터 구조와 첫 5개 샘플을 출력해보세요.
Gradio로 간단한 텍스트 요약 데모를 만들어보세요. pipeline("summarization")을 활용하면 됩니다.